- Title
- A new feature selection approach based on proximity graphs and evolutionary computation
- Creator
- Abu Zaher, Amer
- Relation
- University of Newcastle Research Higher Degree Thesis
- Resource Type
- thesis
- Date
- 2017
- Description
- Research Doctorate - Doctor of Philosophy (PhD)
- Description
- Feature selection is an important step for generating new knowledge from datasets that originate from high-throughput biotechnologies. The requirements posed by highdimensional data arising in several bioinformatics’ algorithmic pipelines require new ideas to be tested. This research aims at reducing the dimensionality of a dataset by a search process which iteratively removing features that do not compromise the classification of a set of samples. To accomplish that, this work present a new type of supervised and multivariate feature selection approach that incorporates the use of proximity graphs. Three proximity graphs are studied in this work: the minimum spanning tree (MST), the K-nearest neighbours graph (KNN) and the relative neighbourhood graph (RNG). The input is assumed to be a matrix where each row represents a feature, and each column, a sample. The class of each sample is also given. The method works by interactively calculating a distance matrix between samples, but this calculation is performed only considering a subset of the features that have been selected. A proximity graph is computed based on the distance matrix. The number of edges in the proximity graph connecting the nodes of different classes, the size of selected features and a score that is calculated from the computed proximity graph are then used in designing four different merit functions to measure the quality of the feature subset were investigated as scores to guide the search process. Evolutionary computation methods are then employed to select the subset of features that minimizes the number of edges of the proximity graph that are connecting nodes of different classes; other score functions have been tested as a comparison. The work also explores the use of different algorithms that are based on evolutionary computation (EC): the steady state genetic algorithm (SSGA), the generational genetic algorithm (GGA) and a memetic algorithm (MA). Four publicly available datasets are used (one also includes a training and test set independent set of samples) for evaluating different algorithms. The classification performance is evaluated using a total of 49 methods from the highly popular open source data mining and machine learning package known as WEKA; this selection helps to the reproducibility of our methods and results. In addition, the performance of the final feature selection algorithm is compared against that one of a panel of nine well-known feature selection methods. The proposed method provides an alternative to other types of feature selection methods and highlights the potential of using proximity graphs as a proxy for quality of discrimination between pairs of labelled samples in high-dimensional spaces.
- Subject
- feature selection; evolutionary algorithm; proximity graph; minimum spanning tree; k-nearest neighbours graph; relative neighbourhood graph
- Identifier
- http://hdl.handle.net/1959.13/1349840
- Identifier
- uon:30451
- Rights
- Copyright 2017 Amer Abu Zaher
- Language
- eng
- Full Text
- Hits: 5496
- Visitors: 6045
- Downloads: 817
Thumbnail | File | Description | Size | Format | |||
---|---|---|---|---|---|---|---|
View Details Download | ATTACHMENT01 | Thesis | 3 MB | Adobe Acrobat PDF | View Details Download | ||
View Details Download | ATTACHMENT02 | Abstract | 547 KB | Adobe Acrobat PDF | View Details Download |